Week 5: Classification technique
The University of Sydney
Important
e1071::svm (Support Vector Machines)stats::glm (Logistic regression via Generalized Linear Model)class::knn (k-NN classifier)MASS::LDA (Linear Discriminant Analysis)This presentation is based on the SOLES reveal.js Quarto template and is licensed under a Creative Commons Attribution 4.0 International License.
\text{MSE} = \mathbb{E}[(Y - \hat{f}(X))^2]
\text{MSE} = \mathbb{E}[(y_0 - \hat{f}(x_0))^2] = \text{Bias}^2 + \text{Var}(\hat{f}(x_0)) + \text{Var}(\varepsilon)
Bias and Variance in Machine Learning
Each Observation Consists of:
Class Label y – categorical variable
Feature Vector \boldsymbol{x} = (x_1, x_2, \ldots, x_p) – mix of categorical and continuous variables
Goal:
To classify y using \boldsymbol{x}
Available data: \{\boldsymbol{x}_i,y_i\}_{i=1}^n
Classification models produce a continuous valued prediction, which is usually in the form of a probability (i.e., the predicted values of class membership for any individual sample are between 0 and 1 and sum to 1).
Require a rule to assign observations to a category based on the predicted probability.
Clustering: classes are unknown, want to discover them from the data (unsupervised)
Classification: classes are predefined, want to use a training set of labeled objects to form a classifier for classification of future observations (supervised)
Binary in there are two possible values (0 or 1, TRUE or FALSE)
Examples of binary classification:
Labels are similarly described, y \in \left\{ 0, 1\right\}
Aim is to predict the tumour type — whether it is benign (non-cancer) or malignant (cancer) — based on tumour size, thickness, and other clinical or imaging characteristics.
Need: a model outputs the probability of each tumour type. Using a decision rule, a tumour will be classified as malignant if P(Malignant| features ) > 0.5, otherwise, it is assigned as benign.
LDA classifies data by finding the decision boundary that best separates different classes, assuming the data follows Gaussian distributions with a class specific mean (\mu_k) and a common variance (\sigma^2).
\begin{align*} p_k(x) = \color{red}{P(Y = k| X = x)} = \frac{\color{blue}{\pi_k} f_k(x)}{\sum_{\ell = 1}^K\pi_\ell f_\ell(x)} \end{align*}
Posterior: The probability of classifying observation to group k given it has features x
Prior: The prior probability of an observation in general belonging to group k
f_k(x) is the density function for feature x given it’s in group k
Using the Bayes’ Theorem, we model
\begin{align*} p_k(x) &= P(Y = k| X = x) = \frac{\pi_k f_k(x)}{\sum_{\ell = 1}^K\pi_\ell f_\ell(x)}\\ \pi_k&: \text{Probability of coming from class }k \text{ (prior probability)}\\ f_k(x)&: \text{Density function for }X\text{ given that }X\text{ is an observation from class }k \end{align*}
\begin{align*} \widehat{\mu}_k &= \frac{1}{n_k} \sum_{i:y_i = k} x_i\\ \widehat{\sigma}^2 &= \frac{1}{n - K} \sum_{k = 1}^K \sum_{i:y_i = k} (x_i - \widehat{\mu}_k)^2\\ \widehat{\pi}_k &= n_k/n \end{align*}
In the case where n is small, and the distribution of predictors X is approximately normal, then LDA is more stable than Logistic Regression.
LDA is more popular when we have more than two response classes. More intuitive to predict class assignment. But logistic regression CAN be generalised for classification problems with more than two classes.
When classes are perfectly separated, the parameter estimates in logistic regression become infinite. This can lead to instability in software (e.g., R) when estimating parameters. In contrast, LDA remains stable and does not face this issue.
LDA may perform poorly for categorical predictors due to the normality assumption.
Similarity:
Differences:
LDA assumes that the observations are drawn from the normal distribution with common variance in each class, while logistic regression does not have this assumption
LDA would do better than Logistic Regression if the assumption of normality hold, otherwise logistic regression may outperform LDA
kNN is completely non-parametric: No assumptions are made about the shape of the decision boundary
We can expect kNN to dominate both LDA and Logistic Regression when the decision boundary is highly non-linear
kNN does not tell us which predictors are important (no table of coefficients)
a decision boundary (hyperplane) that best separates the two classes by maximising the margin between them.
If \beta_0 = 0, the hyperplane passes through the origin, otherwise it does not
In p = 2 dimensions, the hyperplane is a line
\begin{align*} \max_{\beta_0, \beta_1, \beta_2, \ldots, \beta_p} M \quad \text{such that} & \quad \sum_{j = {\color{red}1}}^p \beta_j^2 = 1\\ & \quad y_i ( \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots +\beta_p x_{ip}) \ge M,~i=1,\ldots,n \end{align*}
What are support vectors?
A. all the points on the graph.
B. the points that are closest to the decision boundary (dashed lines).
C. the red and the blue point line on the dashed lines.
If a basic mathematical plane is not possible due to non-perfect separation or outliers:
Observations allowed in the margin
Observations on the correct side of hyperplane
Support Vector Classifier solves the following optimization problem: \begin{align*} \max_{\beta_0, \beta_1, \beta_2, \ldots, \beta_p, \epsilon_1, \ldots, \epsilon_n} M \quad \text{such that} & \quad \sum_{j = 1}^p \beta_j^2 = 1\\ & \quad y_i ( \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots +\beta_p x_{ip}) \ge M(1 - \epsilon_i),~i=1,\ldots,n\\ & \quad \epsilon_i \ge 0,~i=1,\ldots,n, \quad \sum_{i = 1}^n \epsilon_i \le C \end{align*}
Original data:
Construct a new feature Dosage^2 using the original data:
Enlarge the space of features by including transformations:
Fit (linear) support vector classifier in the expanded feature space
Example: Suppose we start off in 2-dimensional feature space (X_1, X_2)
This leads to non-linear decision boundary in the original space (quadratic conic sections)
A nice illustration on YouTube
Suppose now we replace the inner product with a generalized function of the form \begin{align*} K(\boldsymbol{x}_i, \boldsymbol{x}_j) \end{align*} This function is called a kernel
In this context, the kernel quantifies the similarity of two observations